23 research outputs found

    Automatic coding of short text responses via clustering in educational assessment

    Full text link
    Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the Programme for International Student Assessment (PISA) 2012 in Germany. Free text responses of 10 items with Formula responses in total were analyzed. We further examined the effect of different methods, parameter values, and sample sizes on performance of the implemented system. The system reached fair to good up to excellent agreement with human codings Formula Especially items that are solved by naming specific semantic concepts appeared properly coded. The system performed equally well with Formula and somewhat poorer but still acceptable down to Formula Based on our findings, we discuss potential innovations for assessment that are enabled by automatic coding of short text responses. (DIPF/Orig.

    shinyReCoR. A shiny application for automatically coding text responses using R

    Get PDF
    In this paper, we introduce shinyReCoR: a new app that utilizes a cluster-based method for automatically coding open-ended text responses. Reliable coding of text responses from educational or psychological assessments requires substantial organizational and human effort. The coding of natural language in responses to tests depends on the texts\u27 complexity, corresponding coding guides, and the guides\u27 quality. Manual coding is thus not only expensive but also error-prone. With shinyReCoR, we provide a more efficient alternative. The use of natural language processing makes texts utilizable for statistical methods. shinyReCoR is a Shiny app deployed as an R-package that allows users with varying technical affinity to create automatic response classifiers through a graphical user interface based on annotated data. The present paper describes the underlying methodology, including machine learning, as well as peculiarities of the processing of language in the assessment context. The app guides users through the workflow with steps like text corpus compilation, semantic space building, preprocessing of the text data, and clustering. Users can adjust each step according to their needs. Finally, users are provided with an automatic response classifier, which can be evaluated and tested within the process. (DIPF/Orig.

    From byproduct to design factor: on validating the interpretation of process indicators based on log data

    Get PDF
    International large-scale assessments such as PISA or PIAAC have started to provide public or scientific use files for log data; that is, events, event-related attributes and timestamps of test-takers’ interactions with the assessment system. Log data and the process indicators derived from it can be used for many purposes. However, the intended uses and interpretations of process indicators require validation, which here means a theoretical and/or empirical justification that inferences about (latent) attributes of the test-taker’s work process are valid. This article reviews and synthesizes measurement concepts from various areas, including the standard assessment paradigm, the continuous assessment approach, the evidence-centered design (ECD) framework, and test validation. Based on this synthesis, we address the questions of how to ensure the valid interpretation of process indicators by means of an evidence-centered design of the task situation, and how to empirically challenge the intended interpretation of process indicators by developing and implementing correlational and/or experimental validation strategies. For this purpose, we explicate the process of reasoning from log data to low-level features and process indicators as the outcome of evidence identification. In this process, contextualizing information from log data is essential in order to reduce interpretative ambiguities regarding the derived process indicators. Finally, we show that empirical validation strategies can be adapted from classical approaches investigating the nomothetic span and construct representation. Two worked examples illustrate possible validation strategies for the design phase of measurements and their empirical evaluation

    Basic Arithmetical Skills of Students with Learning Disabilities in the Secondary Special Schools: An Exploratory Study covering Fifth to Ninth Grade

    Get PDF
    The mission of German special schools is to enhance the education of students with Special Educational Needs in the area of Learning (SEN-L). However, recent studies indicate that graduate students with SEN-L from special schools show difficulties in basic arithmetical operations, and the development of basic mathematical skills during secondary special school is not warranted. This study presents a newly developed test of basic arithmetical skills, based on already established tests. The test examines the arithmetical skills of students with SEN-L from fifth to ninth grade. The sample consisted of 110 students from three special schools in Munich. Testing took place in January and June 2013. The test shows to be an effective tool that reliably and precisely assesses students’ performance across different grades. The test items can be used without creating floor and ceiling effects among fifth to ninth grade students with SEN-L. The items’ conformity to the dichotomous Rasch model is demonstrated. The students’ skills turn out to be very heterogeneous, both overall and within grades. Many of the students do not even master basic arithmetical skills that are taught in primary school, although achievement improves in higher grades

    From byproduct to design factor. On validating the interpretation of process indicators based on log data

    Get PDF
    International large-scale assessments such as PISA or PIAAC have started to provide public or scientific use files for log data; that is, events, event-related attributes and timestamps of test-takers\u27 interactions with the assessment system. Log data and the process indicators derived from it can be used for many purposes. However, the intended uses and interpretations of process indicators require validation, which here means a theoretical and/or empirical justification that inferences about (latent) attributes of the test-taker\u27s work process are valid. This article reviews and synthesizes measurement concepts from various areas, including the standard assessment paradigm, the continuous assessment approach, the evidence-centered design (ECD) framework, and test validation. Based on this synthesis, we address the questions of how to ensure the valid interpretation of process indicators by means of an evidence-centered design of the task situation, and how to empirically challenge the intended interpretation of process indicators by developing and implementing correlational and/or experimental validation strategies. For this purpose, we explicate the process of reasoning from log data to low-level features and process indicators as the outcome of evidence identification. In this process, contextualizing information from log data is essential in order to reduce interpretative ambiguities regarding the derived process indicators. Finally, we show that empirical validation strategies can be adapted from classical approaches investigating the nomothetic span and construct representation. Two worked examples illustrate possible validation strategies for the design phase of measurements and their empirical evaluation. (DIPF/Orig.

    Applying psychometric modeling to aid feature engineering in predictive log-data analytics. The NAEP EDM Competition

    Get PDF
    The NAEP EDM Competition required participants to predict efficient test-taking behavior based on log data. This paper describes our top-down approach for engineering features by means of psychometric modeling, aiming at machine learning for the predictive classification task. For feature engineering, we employed, among others, the Log-Normal Response Time Model for estimating latent person speed, and the Generalized Partial Credit Model for estimating latent person ability. Additionally, we adopted an n-gram feature approach for event sequences. Furthermore, instead of using the provided binary target label, we distinguished inefficient test takers who were going too fast and those who were going too slow for training a multi-label classifier. Our best-performing ensemble classifier comprised three sets of low-dimensional classifiers, dominated by test-taker speed. While our classifier reached moderate performance, relative to the competition leaderboard, our approach makes two important contributions. First, we show how classifiers that contain features engineered through literature-derived domain knowledge can provide meaningful predictions if results can be contextualized to test administrators who wish to intervene or take action. Second, our re-engineering of test scores enabled us to incorporate person ability into the models. However, ability was hardly predictive of efficient behavior, leading to the conclusion that the target label\u27s validity needs to be questioned. Beyond competition-related findings, we furthermore report a state sequence analysis for demonstrating the viability of the employed tools. The latter yielded four different test-taking types that described distinctive differences between test takers, providing relevant implications for assessment practice. (DIPF/Orig.

    TROPOMI/S5P total ozone column data: global ground-based validation and consistency with other satellite missions

    Get PDF
    In this work, the TROPOMI near real time (NRTI) and offline (OFFL) total ozone column (TOC) products are presented and compared to daily ground-based quality-assured Brewer and Dobson TOC measurements deposited in the World Ozone and Ultraviolet Radiation Data Centre (WOUDC). Additional comparisons to individual Brewer measurements from the Canadian Brewer Network and the European Brewer Network (Eubrewnet) are performed. Furthermore, twilight zenith-sky measurements obtained with ZSL-DOAS (Zenith Scattered Light Differential Optical Absorption Spectroscopy) instruments, which form part of the SAOZ network (Système d'Analyse par Observation Zénitale), are used for the validation. The quality of the TROPOMI TOC data is evaluated in terms of the influence of location, solar zenith angle, viewing angle, season, effective temperature, surface albedo and clouds. For this purpose, globally distributed ground-based measurements have been utilized as the background truth. The overall statistical analysis of the global comparison shows that the mean bias and the mean standard deviation of the percentage difference between TROPOMI and ground-based TOC is within 0 –1.5 % and 2.5 %–4.5 %, respectively. The mean bias that results from the comparisons is well within the S5P product requirements, while the mean standard deviation is very close to those limits, especially considering that the statistics shown here originate both from the satellite and the ground-based measurements.This research has been supported by the European Space Agency “Preparation and Operations of the Mission Performance Centre (MPC) for the Copernicus Sentinel-5 Precursor Satellite” (contract no. 4000117151/16/1-LG)

    What to make of and how to interpret process data

    Get PDF
    Maddox (2017) argues that respondents\u27 talk and gesture during an assessment inform researchers how a response product has evolved. Indeed, how a task is performed represents key information for psychological and educational assessment. [...] Recently, process data has increasingly gained attention in cognitive ability testing given the digitalization of measurement and the possibility of exploiting log file data. [...] As shown by Maddox for large-scale assessments, even talk and gesture can be regarded as useful process data. In this case, the process data is not only video-recorded but also observed by the interviewer in situ; the interviewer interactively uses it to influence the test-taking process and to reduce construct-irrelevant variance. Thus, like product data (e.g., scores), process data is used to draw inferences. We argue in the following that the interpretation and use of process data and derived indicators require validation, just as product data do (Kane, 2013). This theoretical background, including some examples about log file data, sets the ground for our comments on Maddox\u27s use of "talk and gesture as process data." (DIPF/Orig.
    corecore